Attention-Based LSTM with Multi-Task Learning for Distant Speech Recognition
نویسندگان
چکیده
Distant speech recognition is a highly challenging task due to background noise, reverberation, and speech overlap. Recently, there has been an increasing focus on attention mechanism. In this paper, we explore the attention mechanism embedded within the long short-term memory (LSTM) based acoustic model for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM). Furthermore, multi-task learning architecture is incorporated to improve robustness in which the network is trained to perform both a primary senone classification task and a secondary feature enhancement task. Experiments were conducted on the AMI meeting corpus. On average our model achieved 3.3% and 5.0% relative improvements in word error rate (WER) over the LSTM baseline model in the SDM and MDM cases, respectively. In addition, the model provided between a 2-4% absolute WER reduction compared to a conventional pipeline of independent processing stage on the MDM task.
منابع مشابه
End-to-end attention-based distant speech recognition with Highway LSTM
End-to-end attention-based models have been shown to be competitive alternatives to conventional DNN-HMM models in the Speech Recognition Systems. In this paper, we extend existing end-to-end attentionbased models that can be applied for Distant Speech Recognition (DSR) task. Specifically, we propose an end-to-end attention-based speech recognizer with multichannel input that performs sequence ...
متن کاملSpeech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks
Long Short-Term Memory (LSTM) recurrent neural network has proven effective in modeling speech and has achieved outstanding performance in both speech enhancement (SE) and automatic speech recognition (ASR). To further improve the performance of noise-robust speech recognition, a combination of speech enhancement and recognition was shown to be promising in earlier work. This paper aims to expl...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملEmpirical Exploration of Novel Architectures and Objectives for Language Models
While recurrent neural network language models based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks, Convolutional Neural Network (CNN) language models are relatively new and have not been studied in-depth. In this paper we present an empirical comparison of LSTM and CNN language models on English broadcast news and various conversational telep...
متن کاملMulti-Channel Speech Recognition: LSTMs All the Way Through
Long Short-Term Memory recurrent neural networks (LSTMs) have demonstrable advantages on a variety of sequential learning tasks. In this paper we demonstrate an LSTM “triple threat” system for speech recognition, where LSTMs drive the three main subsystems: microphone array processing, acoustic modeling, and language modeling. This LSTM trifecta is applied to the CHiME-4 distant recognition cha...
متن کامل